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Throughout the different phases of a drug development program, ran- 
domized trials are used to establish the tolerability, safety and efficacy of a 
candidate drug. At each stage one aims to optimize the design of future stud- 
ies by extrapolation from the available evidence at the time. This includes 
collected trial data and relevant external data. However, relevant external data 
are typically available as averages only, for example, from trials on alternative 
treatments reported in the literature. Here we report on such an example from 
a drug development for wet age-related macular degeneration. This disease 
is the leading cause of severe vision loss in the elderly. While current treat- 
ment options are efficacious, they are also a substantial burden for the patient. 
Hence, new treatments are under development which need to be compared 
against existing treatments. 

The general statistical problem this leads to is meta-analysis, which ad- 
dresses the question of how we can combine data sets collected under dif- 
ferent conditions. Bayesian methods have long been used to achieve partial 
pooling. Here we consider the challenge when the model of interest is com- 
plex (hierarchical and nonlinear) and one data set is given as raw data while 
the second data set is given as averages only. In such a situation, common 
meta-analytic methods can only be applied when the model is sufficiently 
simple for analytic approaches. When the model is too complex, for exam- 
ple, nonlinear, an analytic approach is not possible. We provide a Bayesian 
solution by using simulation to approximately reconstruct the likelihood of 
the external summary and allowing the parameters in the model to vary un- 
der the different conditions. We first evaluate our approach using fake data 
simulations and then report results for the drug development program that 
motivated this research. 


1. Introduction. Modern drug development proceeds in stages to establish 
the tolerability, safety and efficacy of a candidate drug [Sheiner (1997)]. At each 
stage and using all relevant information, it is essential to plan the next steps. The 
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collected raw data are measurements of individual patients over time. Pharma- 
cometric models of such raw data commonly use nonlinear longitudinal differ- 
ential equations with hierarchical structure (also known as population models), 
which can, for example, describe the response of patients over time under different 
treatments. Such models typically come with assumptions of model structure and 
variance components that offer considerable flexibility and allow for meaningful 
extrapolation to new trial designs. While these models can be fit to raw data, we 
often wish to consider additional data which may be available only as averages 
or aggregates. For example, published summary data of alternative treatments are 
critical for planning comparative trials. Such external data would allow for indirect 
comparisons as described in the Cochrane Handbook [Higgins and Green (2011)]. 

Methods for the mixed case of individual patient data and aggregate data are 
recognized as important but are limited in their scope so far. For example, in the 
field of pharmaco-economics, treatments need to be assessed which have never 
been compared in a head-to-head trial. Methods such as matching-adjusted indirect 
comparisons (MAIC) [Signorovitch et al. (2010)] and simulated treatment com- 
parisons (STC) [Caro and Ishak (2010), Ishak, Proskorovsky and Benedict (2015)] 
have been proposed to address the problem of mixed data in this domain. The fo- 
cus of these methods is a retrospective comparison of treatments while we seek 
a prospective comparison under varying designs. That is, in the MAIC approach 
the individual patient data is matched to the reported aggregate data using baseline 
covariates. While simple in its application, its utility is limited for a prospective 
planning of new trials which vary in design properties. The STC approach offers 
additional flexibility as it is based on the simulation of an index trial to which other 
trials are matched using predictive equations. However, the approach requires cal- 
ibration for which individual patient data is recommended. Hence, the effort of an 
STC approach is considerable, and its flexibility is still limited, since the simulated 
quantities are densities of the endpoints. In contrast, longitudinal nonlinear hierar- 
chical pharmacometric models have the ability to simulate the individual patient 
response over time and, hence, give the greatest flexibility for prospective clinical 
trial simulation, which provides valuable input to strategic decisions for a drug 
development program. 

Here we report on an example of a drug development program to investigate 
new treatment options for wet age-related macular degeneration (wetAMD); see 
Ambati and Fowler (2012), Buschini et al. (2011), Khandhadia et al. (2012), 
Kinnunen et al. (2012). This disease is the leading cause of severe vision loss in the 
elderly [Augood et al. (2006)]. Available drugs include anti-vascular endothelial 
growth factor (anti- VEGF) agents, which are repeatedly administered as direct in- 
jections into the vitreous of the eye. The first anti- VEGF agent was Ranibizumab 
[Brown et al. (2006), Rosenfeld et al. (2006)], with another, Aflibercept [Heier 
et al. (2012)], introduced several years later. Initially, anti-VEGF intravitreal in- 
jections were given monthly, and more flexible schemes with longer breaks be- 
tween dosings evolved over recent years to reduce the burden for patients and their 
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caregivers. In addition, a reduced dosing frequency also increases compliance to 
treatment, which ensures sustained long-term efficacy. 

A key requirement for any new anti- VEGF agent is an optimized dosing scheme 
to compare favorably to existing treatment options. For a prospective evaluation of 
new trials, we simulate clinical trials using nonlinear hierarchical pharmacometric 
models in which a new anti-VEGF agent is compared to available treatments with 
various design options. Important design options include the patient population 
characteristics and the dosing regimen, which specifies what dose amount is to be 
administered at which timepoints to a given patient. 

In clinical studies, visual acuity is assessed by the number of letters a patient 
can read from an Early Treatment Diabetic Retinopathy Study (ETDRS) chart, 
expressed as best corrected visual acuity (BCVA) score, where the patient is al- 
lowed to use glasses for the assessment. A nonlinear pharmacometric drug-disease 
model is able to longitudinally regress the efficacy response as a function of the 
patients’ characteristics and individual dosing history. This flexibility reduces con- 
founding (through covariates and accounting for noncompliance) during inference 
and enables realistic extrapolation to future designs with alternative dosing regi- 
mens. However, these models do require certain raw data that are commonly not 
reported in the literature. In our example, raw patient data from Ranibizumab tri- 
als were available to us, but we only had aggregate data available for Aflibercept. 
This creates the awkward situation that the reported aggregate data on Aflibercept 
cannot be used to obtain accurate model predictions despite our understanding 
that the nonlinear model is appropriate for the same patient population and we are 
moreover only interested in population predictions, that is, the interest lies in pop- 
ulation parameters and not in patient specific parameters. The problem is that the 
likelihood function for the aggregated data in general has no closed form expres- 
sion. The standard expectation—maximization or Bayesian approach in this case 
is to consider the unavailable individual data points as missing data, but this can 
be computationally prohibitive as it will vastly increase the dimensionality of the 
problem space in an experiment with hundreds of patients and multiple measure- 
ments per patient. 

This paper describes how we enabled accurate clinical trial simulations to in- 
form the design of future studies in wetAMD, which aim at improving the dosing 
regimens of anti-VEGF agents. This led us to develop a novel statistical computa- 
tional approach for integrating averaged data from an external source into a linear 
or nonlinear hierarchical Bayesian analysis. The key point is that we use an approx- 
imate likelihood of the external average data instead of using an approximate prior 
derived from the external data. Doing so enables coherent joint Bayesian inference 
of raw and summary data. The approach takes account of possible differences in 
the model in the two data sets. 

In Section 2, we describe the data and model for our study, and Section 3 
lays out our novel approach for including aggregate data into the pharmacomet- 
ric model. Section 4 demonstrates our approach using simulation studies of a lin- 
ear and a nonlinear example. In the linear example we compare our approach to 
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an exact analytic reference, the nonlinear case is constructed to be similar in its 
properties to the actual pharmacometric model. We present results for our main 
problem in Section 5 and conclude with a discussion in Section 6. Source code of 
R and Stan programs of simulation studies and drug-disease model can be found 
in the Supplementary Material [Weber et al. (2018)]. 


2. Data and pharmacometric model. 


2.1. Study data. We included in the analysis data set the raw data from the 
studies MARINA, ANCHOR and EXCITE [Rosenfeld et al. (2006), Brown et al. 
(2006), Schmidt-Erfurth et al. (2011)]. In MARINA and ANCHOR, a monthly 
(q4w) treatment with Ranibizumab was compared to placebo and active control, 
respectively. In MARINA, a high and a low dose regimen treatment arm with 
Ranibizumab were included in the trial. The EXCITE study tested the feasibil- 
ity of an alternative dosing regimen with longer, three months (q12w), treatment 
intervals after an initial three month loading phase of monthly treatments with 
Ranibizumab. We restricted our analysis to the efficacy data only for up to one year 
which is the follow-up time for the primary endpoints of these studies. We consider 
the reported BCVA measure of the number of letters read from the ETDRS chart 
which contains 0-100 letters. 

For Aflibercept no raw data from patients are available in the public domain; 
only literature data of reported mean responses are available from the VIEW 1 and 
VIEW2 studies [Heier et al. (2012)]. These studies assessed noninferiority of a 
low/high dose q4w and an eight week (q8w) dosing regimen with Aflibercept in 
comparison to 0.5 mg q4w Ranibizumab treatment, which was also included in 
these studies as reference arm. Figure 1 shows the reported mean BCVA data of 
VIEW 1-+2. In Table 1, we list the baseline characteristics for all the included study 
arms in the analysis. 


2.2. Pharmacometric model. We use a drug-disease model, which is informed 
on the basis of raw measurements of individual patients over time. Such a model 
[Weber et al. (2014)] was developed on the available raw data for Ranibizumab 
using the studies MARINA, ANCHOR and EXCITE. The visual acuity measure 
(BCVA) is limited to the range of 0-100 (letters read from the ETDRS chart), so, 
we modeled it on a logit transformed scale, Rj(t) = logit(y;,/100), where y jx is 
the measurement for patient j at time t = x,;. The drug-disease model used was 
derived from the semimechanistic turnover model [Jusko and Ko (1994)], which 
links a drug concentration, C;(t), with a pharmacodynamic response, Rj; (t). The 
drug concentration, C ;(t), is determined by the dose amount and dosing frequency 
as defined by the regimen. In our case the drug concentration, C;(f), is latent, since 
no measurements of C ;(t) in the eye of a patient is possible for ethical and practical 
reasons. Therefore, we used a simple mono exponential elimination model and 
fixed the vitreous volume to 4mL [Hart (1992)] and the elimination half-life 1/2 
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Fic. 1. Published average data of the VIEW1+2 studies [Heier et al. (2012)]. Shown is the re- 
ported mean baseline change best-corrected visual acuity (BCVA) over a time period of one year. 
The vertical line at the last time point marks one standard error of the reported mean. 


TABLE | 
Baseline data of trials included in the analysis. The reported baseline BCVA and age are the 
respective mean values and their standard deviations 


Dose BCVA (SD) Age (SD) 
Study Data Compound N Freq. [mg] [letter] Ly] 
MARINA patient Ranibizumab 238 Q4w 0.3 53.1 (12.9) 774 (7.6) 
MARINA patient Ranibizumab 239 Q4w 0.5 53.7 (12.8) 76.8 (7.6) 
MARINA patient Placebo 236 Q4w sham 53.9 (13.7) 77.1 (6.6) 
ANCHOR patient Ranibizumab 137 Q4w 0.3 47.1 (12.8) 77.3 (7.3) 
ANCHOR patient Ranibizumab 139 Q4w 0.5 47.1 (13.2) 75.9 (8.5) 
EXCITE patient Ranibizumab 120 Ql2w 0.3 55.8 (11.8) 75.1 (7.5) 
EXCITE patient Ranibizumab 118 Ql2w 0.5 57.7 (13.1) 75.8 (7.0) 
EXCITE patient Ranibizumab 115 Q4w 0.3 56.5 (12.2) 75.0 (8.3) 
VIEW 1 average Aflibercept 301 Q4w 0.5 55.6 (13.1) 78.4 (8.1) 
VIEW1 average Aflibercept 304 Q4w 2.0 55.2 (13.2) 77.7 (7.9) 
VIEW 1 average Aflibercept 301 Q8w 2.0 55.7 (12.8) 77.9 (8.4) 
VIEW 1 average Ranibizumab 304 Q4w 0.5 54.0 (13.4) 78.2 (7.6) 
VIEW2 average Aflibercept 296 Q4w 0.5 51.6 (14.2) 74.6 (8.6) 
VIEW2 average Aflibercept 309 Q4w 2.0 52.8 (13.9) 74.1 (8.5) 
VIEW2 average Aflibercept 306 Q8w 2.0 51.6 (13.9) 73.8 (8.6) 
VIEW2 average Ranibizumab 291 Q4w 0.5 53.8 (13.5) 73.0 (9.0) 
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from the vitreous to nine days [Xu et al. (2013)]. The standard turnover model 
assumes that the response Rj(t) can only take positive values, which is not given 
on the logit transformed scale. A modified turnover model is therefore used, which 
is defined by the ordinary differential equation (ODE) 

dR j (t) 

dt 

The drug effect enters this equation via the function $;, which is typically chosen 
to be a Hill function of the concentration C;(t). The Hill function is a logistic 
function of the log drug concentration, logit~! (log EC50—log C j(t)). At baseline, 
Rj(t = 0) = Ro; defines the initial condition for the ODE. The model in equation 
(1) has an important limit whenever a time constant stimulation, S(t) = s;, is 
applied. Then, the ODE system drives Rj (t) towards its stable steady-state, which 
is derived from equation (1) by setting the left-hand side to 0, Re = (kin / kot) + 
Emax j5j- In absence of a drug treatment, no stimulation 1s present; that is, S(t) = 


() 


=F KI" Ri(O) = Emaxj5j(Cj)]. 


s; =0, hence, the ratio kin / Kou is of particular importance, as for placebo patients 
it holds that lim;-,o9 Rj(t) = kin / kent, The drug-disease model describes treated 
patients in relation to placebo patients and separates the drug-related parameters 
(t1/2, Emax and EC50) from the remaining nondrug-related parameters. 


3. Bayesian aggregation of average data. 


3.1. General formulation. We shall work in a hierarchical Bayesian frame- 
work. Suppose we have data y = (yjx; j=1,...,J;k =1,...,7T) on J individ- 
uals at T time points, where each y; = (yj1,-..-, yjr) is a vector of data with 
model p(y;|a;,@). Here, each a; is a vector of parameters for individual j, and 
@ is a vector of shared parameters and hyperparameters so that the joint prior is 
p(a, ¢) = p(@) ie p(aj|@), and the primary goal of the analysis is inference 
for the parameter vector @. 

We assume that we can use an existing computer program such as Stan [Stan 
Development Team (2017)] to draw simulations from the posterior distribution, 
pa, bly) « pb) Tay PC@jl) jar POjlej, 9). 

We then want to update our inference using an external data set, y’ = (y} ni 
1,...,J/;k=1,...,7’), on J’ individuals at T’ time points, assumed to be gen- 
erated under the model, PY} la’, o’). There are two complications: 


e The external data, y’, are modeled using a process with parameters ¢’ that are 
similar to but not identical to those of the original data. We shall express our 
model in terms of the difference between the two parameter vectors, 6 = ¢’ — ¢. 
We assume the prior distribution factorizes as p(@, 5) = p(@) p(s). 

We assume that all the differences between the two studies, and the popula- 
tions which they represent, are captured in 5. One could think of ¢ and ¢’ as two 
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instances from a population of studies. If we were to combine data from several 
external trials, it would make sense to include between trial variation using an 
additional set of hyperparameters in the hierarchical model. 

e We do not measure y’ directly; instead, we observe the time series of averages, 
y’ = (y},---» ¥p). And, because of nonlinearity in the data model, we cannot 
simply write the model for the external average data, p(y’|a’, d’), in closed 
form. 


This is a problem of meta-analysis, for which there is a longstanding concern when 
the different pieces of information to be combined come from different sources or 
are reported in different ways [see, e.g., Higgins and Whitehead (1996), Dominici 
et al. (1999)]. 

The two data issues listed above lead to corresponding statistical difficulties: 


e If the parameters ¢’ of the external data were completely unrelated to the pa- 
rameters of interest, ¢—that is, if we had a noninformative prior distribution 
on their difference, 5—then there would be no gain from including the external 
data into the model, assuming the goal is to learn about @. 

Conversely, if the two parameter vectors were identical, so that 6 = 0, then we 
could just pool the two data sets. The difficulty arises because the information 
is partially shared, to an extent governed by the prior distribution on 6. 

e Given that we see only averages of the external data, the conceptually simplest 
way to proceed would be to consider the individual measurements y; ; a8 miss- 
ing data, and to perform Bayesian inference jointly on all unknowns, obtaining 
draws from the posterior distribution, p(@, 6, a, a’|y, y’). The difficulty here 
is computational. Every missing data point adds to the dimensionality of the 
joint posterior distribution, and the missing data can be poorly identified from 
the model and the average data. Weak data in a nonlinear model can lead to a 
poorly regularized posterior distribution that is hard to sample from. 


As noted, we resolve the first difficulty using an informative prior distribution 
on 6. Specifically, we consider in the following that not all components of ¢, but 
only a few components, differ between the data sets, such that the dimensionality 
of 6 may be smaller than that of ¢. This imposes that some components of 6 are 
exactly 0. 

We resolve the second difficulty via a normal approximation and take advan- 
tage of the fact that our observed data summaries are averages. That is, as we 
cannot construct the patient specific likelihood contribution for the external data 
set, 4 PY} |a’., ¢’), directly, instead we approximate this term by a multivari- 


ate normal, N(y’ |M,, > s) to be introduced below. 


3.2. Inclusion of summary data into the likelihood. Our basic idea is to ap- 
proximate the probability model for the external average data, p(y'|¢’), by a mul- 
tivariate normal with parameters depending on y’. For a linear model this is the 
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analytically exact representation of the average data in the likelihood. For non- 
linear models the approximation is justified by the central limit theorem if the 
summary is an average over many data points. This corresponds in essence to a 
Laplace approximation to the marginalization integral over the unobserved (latent) 
individuals in the external data set y’ as p(y'|¢’) = f p(y’|a’, ’) da’. 

The existing model on y is augmented by including a suitably chosen prior on 
the parameter vector 6 and the log-likelihood contribution implied by the exter- 
nal average data y’. As such, the marginalization integral must be evaluated in 
each iteration s of the MCMC run. Evaluating the Laplace approximation requires 
the mode and the Hessian at the mode of the integrand. Both are unavailable in 
commonly used MCMC software, including Stan. To overcome these computa- 
tional issues, we instead use simulated plug-in estimates. In each iteration s of the 
MCMC run we calculate the Laplace approximation of the marginalization integral 
as follows: 


1. Compute $/ = ¢; + 4s. 

2. Simulate parameters a; and then data yj, j=1,...,J,k= een ae 
for some number J of hypothetical new individuals, drawn from the distribution 
p(y’ |@{) and corresponding to the conditions under which the external data were 
collected (hence, the use of the same number of time points T’). The J individuals 
do not correspond to the J’ individuals in the external data set; rather, we simulate 
them only for the purpose of approximating the likelihood of the external average 
data, y’, under these conditions. The choice of J must be sufficiently large, as is 
discussed below. 

3. Compute the mean vector and the T’ x T’ covariance matrix of the simulated 
data y. Call these M, and &,. 

4. Divide the covariance matrix ¥, by J’ to get the simulation estimated co- 
variance matrix for y’, which is an average over J’ individuals whose data are 
modeled as independent conditional on the parameter vector ¢’. 

5. Approximate the marginalization integral over the individuals in the external 
y’ data set with the probability density of the observed mean vector of the T’ 
external data points using the multivariate normal distribution with mean M, and 
covariance matrix Fr Us, which are the plug-in estimates for the mode and the 
Hessian at the mode of the Laplace approximation. The density N(y’|Ms, 2) 
then represents the information from the external mean data. 


3.3. Computational issues—Tuning and convergence. For the simulation of 
the J hypothetical new individuals we do need random numbers which are inde- 
pendent of the model. However, as Bayesian inference results in a joint probability 
density, we cannot simply declare an extra set of parameters in our model during 
an MCMC run. That is, we can only control for the prior density of these extra 
parameters but not so for the posterior density, which is generated by the sam- 
pler. This is an issue, as by construction of Hamiltonian Monte Carlo (HMC), as 


BAYESIAN AGGREGATION OF AVERAGE DATA 1591 


used in Stan, no random numbers can be drawn independently from the model dur- 
ing sampling. However, our algorithm does not require that the random numbers 
change from iteration to iteration. Hence, we can simply draw a sufficient amount 
of random numbers per chain and include these as data for a given chain. As con- 
sequence, different chains may converge to different distributions due to different 
initial sets of random numbers. However, with increasing simulation size ni , the 
simulations have a decreasing variability in their estimates, as the standard error 
scales with J~!/?. Therefore, the tuning parameter J must be chosen sufficiently 
large to ensure convergence of all chains to the same result. This occurs once the 
standard error is decreased below the overall MC error. Whenever J is chosen too 
small, standard diagnostics like R [Gelman et al. (2014)] will indicate nonconver- 
gence. We assess this by running each odd chain with J and each even chain with 
2J hypothetical new individuals (typically we run four parallel MCMC chains as 
this is free on a four processor laptop or desktop computer). The calculation of R 
then considers chains with different J , and, so, a too low J will immediately be 
detected, in which case the user can increase J. 

For models with a Gaussian residual error model, Step 2 above can be simpli- 
fied. Instead of simulating observed fake data ¥, it suffices to simulate the averages 
of the hypothetical new individuals J at the 7’ time points. The residual error term 
can be added to the variance—covariance matrix ©, as diagonal matrix. Should 
the sampling model not be normal, then normal approximations should be consid- 
ered to use. The benefit is a much reduced simulation cost in each iteration of the 
MCMC tun. 


4. Simulation studies. 


4.1. Hierarchical linear regression. We begin with a fake data hierarchical 
linear regression example, which is simple enough that we can compare our ap- 
proximate inferences to a closed form analytic solution to the problem as the un- 
observed raw data can be marginalized over in a full analytic approach. We set 
up this example to correspond in its properties to the longitudinal nonlinear drug- 
disease model. 

We consider a linear regression using a continuous covariate x (correspond- 
ing to time) with an intercept, a linear, and a quadratic slope term. The inter- 
cept and linear slope term vary in two ways which is by individual and data 
set. The quadratic term does not vary by individual or data set. This allows us 
to check two aspects: (a) if we can learn differences between data sets (intercept 
and slope) and (b) if the precision on fully shared parameters (quadratic term) 
increases when combining data sets. That is, for the main data set y, the model 
is yjk ~ N(aji + @j2x~ + Bxz,02), with prior distribution a; ~ N(ua, Zq) for 
which we set the correlations Pajiaj2 (the off-diagonal elements of X,) to 0. 
Using the notation from Section 3.1, the vector of shared parameters ¢ is ¢ = 
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(Mal; Ma2, B, Cv1, G2, Cy). We assume that the number of individuals J is large 
enough such that we can assign a noninformative prior to @. 

For the external data set y’, the model is Vix ~ N@, + ol! Xk + Bxz, oS), 
with the prior distribution a’. ~N(u/,, Za). In this simple example, we assign a 
noninformative prior distribution to 6 = /, — a while we assign a 6 of exactly 0 
to all other components in @ such that #’ = (a1 + 51, a2 + 52, Bs Ow1, Fa2; Oy). 


Assumed parameter values. We create simulations assuming the following 
conditions, which we set to roughly correspond to the features of the drug-disease 
model: 


e J = 100 individuals in the original data set, each measured T = 13 times (cor- 


responding to measurements once per month for a year), x, = 0, or Hagel, 
e J’ = 100 individuals in the external data set, also measured at these 13 time 
points. 


© (Leal, %a1) = (0.5, 0.1), corresponding to intercepts that are mostly between 0.4 

and 0.6. The data from our actual experiment roughly fell on a 100-point scale, 

which we are rescaling to O—1 following the general principle in Bayesian anal- 

ysis to put data and parameters on a unit scale [Gelman (2004)]. 

(a2; Cy2) = (—0.2, 0.1), corresponding to an expected loss of between 10 and 

30 points on the 100-point scale for most people during the year of the trial. 

© Pxj\a;. = 9, no correlation assumed between individual slopes and intercepts. 

e £ = —0.1, corresponding to an accelerating decline representing an additional 
drop of 10 points over the one-year period. 

e oy =0.05, indicating a measurement and modeling error on any observation of 
about five points on the original scale of the data. 


Finally, we set 5 to (0.1, 0.1), which represents a large difference between the two 
data set in the context of this problem and allows us to test how well the method 
works when the shift in parameters needs to be discovered from data. 

In our inferences, we assign independent unit normal priors for all the parame- 
ters Let, a2, 6, 61, and 62; and independent half unit normal priors to the variance 
components 0y1, G2, and oy. Given the scale of the problem (so that parameters 
should typically be less than one in absolute value, although this is not a hard 
constraint), the unit normals represent weak prior information which just serves to 
keep the inferences within generally reasonable bounds. 


Conditions of the simulations. We run four chains using the default sampler 
in Stan, the HMC variant No-U-Turn Sampler (NUTS) [Hoffman and Gelman 
(2014), Betancourt (2016)], and set J to 500, so that every odd chain will sim- 
ulate 500 and every even 1000 hypothetical individuals, thus allowing us to easily 
check if the number of internal simulations is enough for stable inference. If there 
were notable differences between the inferences from even and odd chains, this 
would suggest that J = 500 is not enough and should be increased. 
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Computation and results. We simulate data y and y’. For simplicity we do 
our computations just once in order to focus on our method only. If we wanted to 
evaluate the statistical properties of the examples, we could nest all this in a larger 
simulation study. 

We first evaluate the simulation based approximation of the log-likelihood con- 
tribution of the mean data y’. This is shown in the top panel of Figure 2. The plot 
shows log p(y’|¢’) evaluated at the true value of ¢’ for varying values of 52. The 
gray band marks the 80% confidence interval of 10° replicates when simulating 
per replicate a randomly chosen set of J = 10? patients. The dotted blue line is the 
median of these simulations and the black solid line is the analytically computed 
expression for log p(y’|¢’), which we can compute for this simple model directly. 
Both lines match respectively, which suggests that the simulation approximation is 
consistent with the analytical result. The width of the gray band is determined by 
the number of hypothetical fake patients J. The inset plot shows at a fixed value 
of 52 = 0 the width of the 80% confidence interval as a function of J ina log-log 
plot. The solid black line marks the simulation results while the dashed line has a 
fixed slope of —1/2 and a least-squares estimated intercept. As both lines match 
each other, we can conclude that the scaling of the confidence interval width is 
consistent with « J—!/?. 

We run the algorithm as described below and reach approximate convergence 
in that the diagnostic R is near 1 for all the parameters in the model. We then 
compare the inferences for the different scenarios: 


local: The posterior estimates for the shared parameters @ using just the model fit 
to the local data y. 

full: The estimates for all the parameters ¢, 5 using the complete data y, y’, which 
would not in general be available—from the statement of the problem we see 
only the averages for the new data y’—but we can do so here as we have simu- 
lated data. 

approximate: The estimates when using the approximation scheme for all the 
parameters ¢, 5 using the actual available data y, y’. 

integrated: The estimates when using an analytical likelihood for all of the pa- 
rameters @, 6 using the actual available data y, y’. In general, it would not be 
possible to compute this inference directly, as we need the probability density 
for the averaged data, but in this linear model this distribution has a closed-form 
expression which we can calculate. 


The bottom panel of Figure 2 shows the results of the parameter estimates as bias. 
We are using informative priors and so we neither desire nor expect expect a bias 
of exactly 0. Rather we would like to see for each parameter a match of the approx- 
imate estimate (blue line with a square) with the estimate of the full scenario (or- 
ange line with a triangle), which corresponds to the correct Bayes estimate. How- 
ever, we cannot expect that the full scenario matches the approximate estimate, 
since the correct Bayes estimate for the full scenario is given by p(@, 5|y, y’), 
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FiG. 2. Hierarchical linear model example. (Top) Comparison of the analytical expression for 
log p(y'|¢"), shown as a solid black line, to the simulation based multivariate normal approximation 
NG” IMs, qr Xs). The simulation includes J = 102 hypothetical individuals, and 10° replicates were 
performed to assess its distribution. The gray area marks the 80% confidence interval and the dotted 
blue line is the median of the simulations. The inset shows the width of the 80% confidence interval 
at 8) =0 as a function of the simulation size J on a log-log scale. The dotted line has a fixed slope of 
—1/2 and the intercept was estimated using least squares. (Bottom) The model estimates are shown 
as bias for the four different scenarios as discussed in the text. Lines show the 95% credible intervals 
of the bias and the center point marks the median bias. The MCMC standard error of the mean is for 
all quantities below 10-4, 


which is based on the individual raw data y and y’ instead of y and mean data y’. 
The appropriate comparison is with reference to the integrated scenario (red line 
with a cross), which is the correct Bayes estimate of p(@, 5|y, y’). The integrated 
and the approximate scenarios do match closely for all parameters. 
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When comparing the full scenario with the approximate and integrated result, 
one can observe that the variance components og; and 0,2 are estimated with 
higher precision in the full scenario. This is a direct consequence of using the 
reported means only for the external data. 

Including the averaged data y’ into the model does not inform the variance com- 
ponents dy; and oy2, but it does increase the precision of all other parameters 
in @. This can be observed by considering the reduced width of the credible in- 
tervals when comparing the local scenario (green line with a dot) to the others, in 
particular for 4¢2 and B. The estimates of 5, and 452 are similar across all cases— 
whenever these can be estimated. This suggests that the external averaged data y’ 
are just as informative for the 5 vector as the individual raw data y’ themselves. 
The main reason as to why the precision of the 6 estimate is a little higher for 
the full scenario is related to the estimates of the variance components og; and 
O~2. These variance components are estimated from the complete individual raw 
data (y and y’) to be smaller in comparison to the other scenarios. As a result the 
overall weight of each patient to the log-likelihood is larger. This leads to a higher 
precision of the population parameters which can be observed in particular for the 
parameters (iq; and 6. 


4.2. Hierarchical nonlinear pharmacometric model. Next, we perform a fake 
data study that is closely adapted to our application of interest. The function 
R(t) in equation (1) is only implicitly defined; no closed form solution is avail- 
able for the general case. For the simulation study we consider the special case 
of constant maximal drug effect at all times; that is, S;(¢) =s; = 1 for a pa- 
tient 7 who receives treatment or S;(t) = s; =0 for placebo patients otherwise. 
The advantage of this choice is that the ODE can then be solved analytically as 
Ri) = Re + (Roj — R*) exp (=k). In the following we consider three dif- 
ferent cohorts of patients (placebo, treatment 1 and treatment 2) observed at times 
t = xx. Data for treatment 2 will be considered as the external data set and given 
as average data only to evaluate our approach. Measurements yj, of a patient j 
are assumed to be i.i.d. normal, yj;,/100 ~ N(ogit™!(R; (Xk)), ony We assume 
that the number of patients is large enough such that weakly informative priors, 
which identify the scale of the parameters, are sufficient. The above quantities are 
parametrized and assigned the simulated true values and priors for inference as: 


e J = 100 patients in the data set with raw measurements per individual patient. 
The first 7 = 1,...,50 patients are assigned a placebo treatment (Eymax ; = 0) 
and the remaining j = 51,...,100 patients are assigned a treatment with 
nonzero drug effect (Emax; > 0). All patients are measured at T = 13 time 
points corresponding to one measurement per month over a year. We rescale 
time accordingly to x, = 0, > wes lls 

e J’ =100 patients in the external data set, measured at the same 7’ = 13 time 
points. 
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Roj ~ N(Lao, Ole) is the unobserved baseline value of each patient j on the 
logit scale which we set to Lag = 0 corresponding to 50 on the original scale 
and Oya = 0.2. We set the weakly informative prior to Lag ~ N(O, 27) and 
SLa, ~ N*(0, 17). 

e kin / kout = Lats is the placebo steady state, the asymptotic value patients reach 
if not on treatment (or treatment is stopped). In the example, lower values of the 
response correspond to worse outcome. We set the simulated values to La, = 
logit(35/100) and the prior to La; ~ N(—1, 2), 

log(1/ kent) ~ Nik, cra determines the patient-specific time scale of the ex- 


ponential changes (k°"' is a rate of change). We assume that changes in the 
response happen within 10/52 time units, which led us to set /k = log(10/52) 
and we defined as a prior /k ~ N(log(1/4), log(2)) and ox ~ N+(0, 1”). 
log(Emax;) is the drug effect for patient j. If patient j is in the placebo 
group, then Emax; = 0. For patients receiving the treatment | drug we as- 
sumed log( Emax j) =/ Emax ; = log (logit(60/100) — logit(35/100)), which rep- 
resents a gain of 25 points in comparison to placebo. Patients in the exter- 
nal data set y’ are assumed to have received the treatment 2 drug and are as- 
signed a different /E/,,,. We consider 6 = /E/,,, — 1Emax = 0.1, which cor- 
responds to a moderate to large difference [exp(0.1) ~ 1.1]. As priors we use 
lEmax ~ N(log(0.5), log(2)*) and 6 ~ N(O, 1”). 

e oy =0.05 is the residual measurement error on the original letter scale divided 

by 100. The prior is assumed to be oy ~ NT (0, i), 


All simulation results are shown in Figure 3. In the top panel of Figure 3 an 
assessment of the sampling distribution of our approximation is shown for a sim- 
ulation size of J = 10* hypothetical fake patients and 10° replicates. Since for 
this nonlinear example we cannot integrate out analytically the missing data in the 
external data set such that there is no black reference line as before. However, we 
can conclude that the qualitative behavior of a maximum around the simulated true 
value is like that in the linear case. Moreover, the inset confirms that the scaling of 
the precision of the approximation with increasing simulation size J of hypotheti- 
cal fake patients scales as a power law consistent with « J~!/?. 

For the model we run four chains and set J to 500 as before. The model es- 
timates are shown as bias in the bottom panel of Figure 3. The precision of the 
estimates from the local fit (green line with a dot) increases when adding the ex- 
ternal data. While population mean parameters gain in precision in the full (orange 
line with a triangle) and approximate (blue line with a square) scenarios, the pre- 
cision of variance component parameters like o7q, and oj, only increase in the 
full scenario. This is expected as the mean data y’ does not convey information 
on between-subject variation. However, it is remarkable that the population mean 
parameter estimates for the approximate scenario are almost identical to the full 
scenario, including the important parameter 61. 
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FIG. 3. Drug-disease model example: (Top) Assessment of the distribution of the multi-variate 
normal approximation to log p(y'|¢’) at a simulation size of J =102 hypothetical fake patients using 
103 replicates for varying 51. The gray area marks the 80% confidence interval, the blue dotted line 
is the median of the simulations. The inset shows the width of the 80% confidence interval at 5, =0 
as a function of the simulation size J on a log-log scale. The dotted line has a fixed slope of —1/2 
and the intercept was estimated using least squares. (Bottom) The model estimates are shown as bias 
for the three different scenarios as discussed in the text. Lines show the 95% credible intervals of the 
bias and the center point marks the median bias. The MCMC standard error of the mean is for all 
quantities below ig, 


We can conclude that possible differences in a drug-related parameter, 5], can 
equally be identified from individual raw data as from the external mean data only. 
The mean estimate for 6; and its 95% credible interval in the full scenario (y, y’) 
and the approximate scenario (y, y’) do match one another closely. 
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5. Results for the drug development application. We now turn to the ap- 
plication of our approach for the development of a new drug for wetAMD. For 
Aflibercept no raw data from patients is available in the public domain; only liter- 
ature data of reported mean responses are available [Heier et al. (2012)]. Hence, 
extrapolation for Aflibercept treatments on the basis of the developed drug-disease 
model was not possible. The drug-related parameters of the drug-disease model 
are the elimination half-life rt, /2, the maximal drug effect, / Emax and the concen- 
tration at which 50% of the drug effect is reached, /EC50 (both parameters are 
estimated on the log scale). The elimination half-life is fixed with a drug specific 
value in our model from values reported in the literature for each drug. We can in- 
form the latter two parameters for Ranibizumab from our raw data, which comprise 
a total of N = 1342 patients from the studies MARINA, EXCITE and ANCHOR; 
the data from the VIEW1+-2 studies [Heier et al. (2012), N = 1210 + 1202] en- 
ables us to estimate these parameters for Aflibercept. Following our approach, we 
modified the existing model on Ranibizumab to include a 6 parameter [with a 
weakly informative prior of N(O, 1)] for each of the drug-related parameters for 
patients on Aflibercept treatment. In addition, we also allowed the baseline BCVA 
of VIEW1+2 to differ as compared to the chosen reference study MARINA. We 
did not include a 6 parameter for any other parameter in the model, since the re- 
maining parameters characterize the natural disease progression in absence of any 
drug. We consider it reasonable to assume that the natural disease progression does 
not change under the two conditions, and in any case it is impossible to infer dif- 
ferences in the natural disease progression as compared to our data set with the 
VIEW 1-2 data since no placebo patients were included in either study for ethical 
reasons. 

It is important to note that the VIEW1-+2 studies included each a 0.5 mg q4w 
treatment arm with Ranibizumab. For these arms only the mean data is reported as 
well, and we include these into our model as a reference—assuming that the drug 
specific parameters are exactly the same for all data sets. 

Figure 1 shows the published mean baseline change BCVA data of the 
VIEW1+2 studies. From the VIEW1+2 studies we choose to include only the 
mean BCVA data of the dosing regimens 2 mg q8w Aflibercept and 0.5 mg q4w 
Ranibizumab into our model, as these are used in clinical practice and are hence 
of greatest interest to describe these as accurately as possible. The total data set 
then included raw data from N = 1342 patients from MARINA, ANCHOR and 
EXCITE (different Ranibizumab regimens and a placebo arm) and N = 1202 pa- 
tients from the reported mean data in VIEW 1+2 (2 mg q8w Aflibercept and 0.5 mg 
q4w Ranibizumab). Since our model is formulated on the scale of the nominally 
observed BCVA measurements, we shifted the reported baseline change BCVA 
values by the per study mean baseline BCVA value. We used the remaining data 
from the 2 mg q4w and 0.5 mg q4w Aflibercept regimens for an out-of-sample 
model qualification. 
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Fic. 4. Main analysis results: (Left) Shows the posterior 95% credible intervals of the estimated 5 
parameters. The dotted lines mark the 95% credible interval of the prior. (Right) Shows the predicted 
mean baseline change BCVA as solid line for the study arms included in the model fit. The gray area 
marks one standard error for the predicted mean, assuming a sample size as reported per arm (about 
300 each, see Table 1). The dots mark the reported mean baseline change BCVA and are shown as 
reference. 


The final result of the fitted model, which uses our internal patient-level data, 
and the VIEW1+2 summary data of the 2 mg q8w Aflibercept and 0.5 mg q4w 
Ranibizumab arms, are shown in Figure 4. Presented are the posteriors of the 6 
parameters (left) and the posterior predictive of the mean baseline change BCVA 
response of the two included regimens of VIEW1+2 (right). 

The posterior predictive distribution of the mean baseline change BCVA is in 
excellent agreement with the reported data for the 2 mg q8w Aflibercept arms of 
VIEW1-+2. The posterior predictive distribution of the 0.5 mg q4w Ranibizumab 
mean data in VIEW1+2 suggests a slight underprediction from the model. How- 
ever, the prediction is for one standard error corresponding to a 68% credible in- 
terval, and, hence, the observed data is well in the usual 95% credible interval. 

When comparing the posteriors of the 6 parameters to their standard normal 
priors (corresponding to a prior 95% credible interval from —1.96 to +1.96), we 
observe that the information implied by the aggregate data of VIEW1+2 for each 
parameter varies substantially. While the 6;£max parameter is estimated with great 
precision to be close to 0, the precision of the 4;ec50 posterior is only increased 
slightly from a prior standard deviation of 1 to a posterior standard deviation of 0.8. 
This is a consequence of the dosing regimens in VIEW1+2, which keep patients 
at drug concentrations well above the /EC50 in order to ensure maximal drug 
effect at all times. In fact, the only trial in our Ranibizumab database where con- 
centrations vary around the range of the /E.C50 is the EXCITE study. This study 
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FiG.5. Out-of-sample model qualification: Shown is the predicted mean baseline change BCVA as 
solid line for the study arms of VIEW1+2 which were not included in the model fitting. The gray area 
marks one standard error for the predicted mean assuming a sample size as reported per arm (about 
300 each, see Table 1). The dots mark the reported mean baseline change BCVA and are shown as 
reference. 


included two q12w Ranibizumab arms which showed a decrease of the BCVA af- 
ter the loading phase such that drug concentrations have apparently fallen below 
the /EC50 which makes its estimation possible; see Schmidt-Erfurth et al. (2011). 

The out-of-sample model qualifications are shown in Figure 5. The 2 mg q4w 
Aflibercept of VIEW2 arm is well predicted by the model, while the respective 
regimen in VIEW1 is predicted less successfully. This arm was reported to have 
an unusually high mean baseline change BCVA outcome for reasons which are 
still not well understood such that we did not investigate further. Moreover, the 
regimen 0.5 mg q4w Aflibercept appears to be under predicted in VIEW2 and 
slightly over predicted in VIEW 1. However, when considering that VIEW1-+2 are 
exactly replicated trials, the observed differences in this arm (see Figure 4) are not 
expected (also note that the ordering for each regimen reversed when comparing 
these in VIEW 1 and VIEW2). If we were to compare our model predictions against 
an averaged result from VIEW 1+-2, these comparisons would look more favorable 
as the study differences would average out. We conclude that the average outcomes 
are well captured while the per arm variations are within limits which are known 
and still unexplained. 

In summary, our final model is able to predict accurately the 2 mg q8w Afliber- 
cept regimen which is our main focus when including the VIEW1-+-2 data into our 
analysis. The 2 mg q8w Aflibercept regimen is one of the treatments for wetAMD 
applied in clinical practice. 
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6. Discussion. Model-based drug development hinges on the amount of infor- 
mation which we can include into models. While hierarchical patient-level nonlin- 
ear models offer the greatest flexibility, they make raw patient-level data a require- 
ment. This can limit the utility of such models considerably, as relevant informa- 
tion may only be available to the analyst in aggregate form from the literature. For 
our wetAMD drug development program the presented approach enabled patient- 
level clinical trial simulations for most wetAMD treatments used in the clinic. 
Our approach was used to plan confirmatory trials which test a new treatment reg- 
imen with less frequent dosing patterns against currently established regimens. 
In particular, these results were used to plan the confirmatory studies HARRIER 
and HAWK, which evaluate Brolucizumab in comparison to Aflibercept. These 
trials test a new and never observed dosing regimen aiming at a reduced dosing 
frequency while maintaining maximal efficacy. Within this regimen patients are 
assessed for their individual treatment needs during a q12w-learning cycle. De- 
pending on this assessment, patients are allocated to a q12w or a q8w schedule. 
A key outcome of the trials is the proportion of patients allocated to the q12w reg- 
imen. Through the use of our approach it was possible to include highly relevant 
information from the literature into a predictive model which supported strategic 
decision making for the drug development program in wetAMD. 

The critical step in our analysis was to model jointly our study data and exter- 
nal aggregate data. We constructed a novel Bayesian aggregation of average data 
which had to overcome three different issues: 


1. Our new data were in aggregated average form; the raw data y’ were not 
available, and we could not directly write or compute the likelihood for the ob- 
served average data y’. 

2. The new data were conducted under different experimental conditions. This 
is a standard problem in statistics and can be handled using hierarchical modeling, 
but here the number of “groups” is only two (the old data and the new data), so 
it would not be possible to simply fit a hierarchical model estimating group-level 
variation from data. 

3. It was already possible to fit the model to the original data y, hence, it made 
sense to construct a computational procedure that made use of this existing fit. 


We handled the first issue using the central limit theorem (CLT), which was 
justified by the large sample size of the external data. This allowed us to approx- 
imate the sampling distribution of the average data by a multivariate normal and 
using simulation to compute the mean and covariance of this distribution, for any 
specified values of the model parameters. 

We handled the second issue by introducing a parameter 6 governing the dif- 
ference between the two experimental conditions. In some settings it would make 
sense to assign a weak prior on 6 and essentially allow the data to estimate the 
parameters separately for the two experiments. In other cases a strong prior on 6 
would express the assumption that the underlying parameters do not differ much 
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between groups. Seen from a different perspective, the new experimental condi- 
tion is considered as a biased observation of an already observed experimental 
condition, which goes back to Pocock (1976). 

Finally, we formulated our approach by extending an existing model. That is, we 
added a term to the log-likelihood of the original model. This term represents the 
information from the external means. We used a nested simulation scheme which 
we ran during the MCMC fit. The key step to perform the nested simulation scheme 
was to generate a sufficiently large sample of random numbers prior to the MCMC 
run and to then use this sample for each iteration of the running MCMC to perform 
effectively a Monte Carlo integration. We expect this nested integration approach 
to be useful in general, since its applicability is not restricted to the presented 
application of marginalizing the likelihood over a latent variable space, but can be 
applied in general during a MCMC run. 

Our proposed approach is an approximate solution with respect to the alterna- 
tive approach, which is to represent the patient-level data of the external data set as 
latent. As our simulation studies have revealed, we are still able to obtain accurate 
estimates of the 6 parameter vector, which is our main objective here. The reason 
is the large sample size of the external data, which ensures that the assumption of 
the CLT holds well. The use of our approximate procedure does lead to a reduction 
of computational resources needed to integrate the external average data. Thus, we 
can then use these freed-up computational resources to model more accurately the 
patient-level data and obtain in return better predictions. As external data sets of 
interest are usually of considerable sample size, we expect this to be an advanta- 
geous choice to spend our finite computational resources in these applications. 

Considering our idea more generally, we have effectively reversed the common 
Bayesian approach in which external data are commonly used to elicit a prior, 
which is then updated with experimental data through the model likelihood. In our 
approach, this paradigm is conceptually reversed. The external data is explicitly 
made part of the model likelihood, which then informs our parameters of interest. 
In this light, we expect that our ideas will allow for future developments of general 
interest, such as the formulation of implicit priors or the definition of an effective 
sample size for complex models using a normal approximation. 

In this work we have expanded the applicability of Bayesian meta-analysis to 
the broad class of nonlinear hierarchical models for the case whenever we wish 
to learn from aggregated average data, which renders data from individuals latent 
and only indirectly reported via means. This situation often times arises in the 
domain of biostatistics which uses meta-analytic approaches in various stages of 
drug development. However, the ideas presented are general and should also find 
application in other domains. For our specific case this work enabled accurate 
clinical trial simulations which supported the design of large phase III trials aiming 
to establish better treatments in wetAMD. 
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SUPPLEMENTARY MATERIAL 


Supplement: Program sources (DOI: 10.1214/17-AOAS1122SUPP; .zip). 
Source code of R and Stan programs of simulation studies and drug-disease model. 
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